Overview

Under ideal conditions and recording methods, the GPS on devices like smartphones typically has a precision of about ±5m, even though it will display coordinates that are much more precise. Relatively minor inaccuracies in the sample locations (5-20m) could result in a lot of miscategorizations and miscalculations when aligning with local GIS layers, particularly highly detailed layers (like the habitat layer or landuse data). I tried to characterize the amount of uncertainty, think of different ways to filter the data, and explore a couple of options for categorizing samples. Briefly:

  • The GEOPRECISION field specifies whether the coordinates were measured, extrapolated, etc. We can remove bad extrapolations.
  • Some coordinates use decimal degrees. These can be filtered to exclude those with too few decimal places (e.g., 46.62ºN 6.72ºE specifies roughly a square kilometer).
  • Assuming good GPS conditions, buffers of 5-10m can determine the composition of plausible habitats for each point.
  • The optional HABITAT field gives descriptions of where the ants were collected. These can be cross-referenced with the habitats extracted by location, under the assumption that the HABITAT field is more accurate.
  • Other land cover datasets like CORINE are less precise, and would not require as much confidence in the point locations for land cover assignments.

I also included some EDA for distances to the nearest road (and nearest road type), distances to the nearest building, and species accumulation curves using the habitat assigned with the 5m buffer (though I’m not convinced using the habitat layer from the structured samples is wise, at least for all categories).


Locational precision

GEOPRECISION

One concern with extracting local variables like habitat, land use, distance from the nearest road, or distance from the nearest building is that that requires a lot of confidence in the latitude and longitude associated with the point locations. The column GEOPRECISION indicates whether the location was extrapolated, corrected, or measured (or some combination).

Summary of geoprecision categorizations.
GEOPRECISION Tubes Percent Percent (non-NA)
mesuré 6159 89.5% 89.6%
extrapolé 624 9.1% 9.1%
extrapolé/corrigé 44 0.6% 0.6%
extrapolé mauvais 17 0.2% 0.2%
NA 12 0.2% -
mesuré/corrigé 11 0.2% 0.2%
extrapolé (base tube précédent) 6 0.1% 0.1%
extrapolé (église par défaut) 5 0.1% 0.1%
extrapolé (gare par défaut) 4 0.1% 0.1%
extrapolé/corrigé (église par défaut) 1 0.0% 0.0%

The coordinates were mostly measured directly by the collector, and only a small proportion were extrapolated badly. In theory, we could assume that mesuré, extrapolé, extrapolé/corrigé, and mesuré/corrigé indicate that the coordinates can be used directly.

Decimal degree digits

The number of reported digits is an estimate of precision for coordinates reported in decimal degrees, but not for the Swiss coordinate system which reports 6 digits no matter what. For latitude and longitude at the equator, an arc-degree corresponds with about 111km. At a longitude of 46ºN, an arc-degree is 76.5km.

Decimals Precision (Lat.) Precision (Lon.)
1 ± 5500 m ± 3825 m
2 ± 555 m ± 383 m
3 ± 55.5 m ± 38.3 m
4 ± 5.55 m ± 3.83 m
5 ± 0.555 m ± 0.383 m
6 ± 0.0555 m ± 0.0383 m

The reported digits can be used to set a minimum bound if, e.g., only 2 digits are reported, but typically devices will report many digits even if they are not justified. There were 3945 tubes (57.3%) reporting the coordinates in decimal degrees, with the rest using the 6-digit Swiss coordinates and no estimate of precision. The decimal degree coordinates include 686 tubes with coordinates extrapolated based on the reported locality. The reliability of the extrapolated coordinates for extracting local variables like habitat or land use type rely on a clear description of the habitat by the collector.

Lat/Lon decimal accuracy (joint coarsest).
Decimals Tubes Percent
1 1 0.0%
2 52 1.3%
3 165 4.2%
4 670 17.0%
5 669 17.0%
6 1226 31.1%
7 229 5.8%
8 933 23.7%

Typically, smartphones are accurate under good conditions to about 5m in radius, with worse performance around buildings, bridges, trees, etc. It therefore seems likely that coordinates with >5 decimal places are overestimating precision. More importantly, the 5.5% of locations with fewer than 4 should not be taken as-is with a high degree of confidence. Again, this metric isn’t possible with the locations recorded with the Swiss coordinate system (2938 tubes: 43%), but it seems reasonable that the distribution of precision would be roughly similar.

Filtering

For extracting local conditions based on point locations, it seems reasonable to buffer all points with 5-10m, with the local habitat or land use type assigned as the dominant category within the buffer. The buffer should not affect distance to nearest road, aside from reducing most distances by a uniform amount and reducing points with distances less than the buffer radius to 0m.

It is also a good idea to remove tubes with GEOPRECISION == "extrapolé mauvais" and possibly "extrapolé (base tube précédent)", "extrapolé (église par défaut)", "extrapolé (gare par défaut)", "extrapolé/corrigé (église par défaut)" as the uncertainty seems likely to be greater than 5-10m. Lastly, tubes with fewer than 3 decimals for the lat/lon coordinates should also be removed for the same reasons. This should maybe be even more strict.

geo_exclude <- c("extrapolé mauvais", 
                 "extrapolé (base tube précédent)", 
                 "extrapolé (église par défaut)", 
                 "extrapolé (gare par défaut)", 
                 "extrapolé/corrigé (église par défaut)")
dec_thresh <- 3  # remove lat/lon with 0-3 decimals
pub_filt <- ant$pub %>% 
  filter(!is.na(GEOPRECISION)) %>%
  filter(!GEOPRECISION %in% geo_exclude) %>%
  filter(is.na(LATITUDE) | nchar(LATITUDE) >= (dec_thresh+3)) %>% 
  filter(is.na(LONGITUDE) | nchar(LONGITUDE) >= (dec_thresh+2)) 
pub.5m <- pub_filt %>% st_buffer(dist=5)
pub.10m <- pub_filt %>% st_buffer(dist=10)

Habitat & locational uncertainty

Habitat datasets

There are three land cover / land use datasets available:

  • Habitat layer created for the structured sampling
  • CORINE Land Cover, which has a broader legend and is consistent across Europe, but uses a minimum mapping unit of 25 ha (500mx500m)
  • Land use for largely agricultural land in Vaud in 2019

CORINE and the Opération Fourmis dataset have full coverage across Vaud, while the detailed land use dataset is mostly restricted to open canopy areas in the lower elevations (OpFo, CORINE, VD).

opfo CORINE VD

Here is a random area within Vaud showing the differences. The grid is 1km x 1km, with the public inventory tubes shown as the small black points (with 5m and 10m buffers), and building footprints from open street maps. (OpFo, CORINE, VD).
opfo CORINE VD

Zoomed in on the edge of a town (OpFo, CORINE, VD):
opfo CORINE VD

Zoomed in close on a couple of points (OpFo, CORINE, VD):
opfo CORINE VD

CORINE

CORINE is often used in the literature, but it is less instinctively appealing since we have access to datasets that are more thematically and spatially accurate/precise with the extent is limited to Vaud. On the other hand, using a less precise dataset like CORINE would be more forgiving of sample location error. CORINE could also potentially be useful for defining broadly whether samples were in cities/towns since the legend includes Continuous/Discontinuous urban fabric (in center column above, purple = discontinuous urban fabric). For the other datasets, the main concern is whether the point locations are reliably precise enough to align them with the layer polygons.

Vaud land use

The land use data from the canton describes the reported usage in 2019, generally for open canopy habitats. The usage is quite detailed, with a focus on agriculture. Organic vs. conventional methods are not included. It uses 3-digit codes where the first digit indicates (roughly) 500: crops, 600: pastures, 700: permanent agriculture, 800: other?, and 900: other?. This table shows the categories with corresponding area and percent of the total.

Habitat: Points & buffers

Using the same habitat categories as the structured samples (first column of plots above), we can calculate the habitat for each tube as the point location, the dominant habitat within 5m, and the dominant habitat within 10m.

We can assign the habitat type for each tube as either the habitat at the point location, ignoring uncertainty, or the dominant habitat within the 5m or 10m buffer. Larger buffers will obviously include more habitat categories, and samples collected along roads or edges would most likely be mis-categorized since those habitat types are unlikely to have the greatest coverage within a 5m or 10m radius. Conversely, even slight inaccuracies in the coordinates would result in mis-categorization of these samples based on the point locations. Assigning a habitat to each point with any degree of confidence is not trivial.

HABITAT: Cross-reference

HABITAT entries

The public samples included a field for habitat, and 3348 tubes (49.3%) include a free-form entry. However, these were not standardized, and there were 1572 unique responses. They range from extremely precise about where the ant was captured to rather general. Some seem to describe the diameter of the tree where the ant was collected.

Many of the habitats used for the OpFo structured samples are unlikely to have direct matches that would allow for unambiguous categorization. Searches for keywords could give an idea of how well the extracted habitat matches the stated habitat for the (mostly) unambiguous keywords.

Forest

Forests are generally large habitat polygons, and most HABITAT descriptions including the word forêt should be describing tubes collected in forest habitat. Edges, borders, and clearings can be filtered out to look at a sort of ‘best case’ scenario.

## Descriptions with 'for*t': 412

Habitat extractions for tubes with HABITAT reported as forest.
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 23 21 25 6.9% 6.3% 7.5%
CulturePerm 1 1 1 0.3% 0.3% 0.3%
ForetConifere 59 59 68 17.8% 17.8% 20.5%
ForetFeuillus 44 43 47 13.3% 13.0% 14.2%
ForetMixe 32 32 115 9.6% 9.6% 34.6%
lisiere 18 20 19 5.4% 6.0% 5.7%
pierrier 1 1 1 0.3% 0.3% 0.3%
transport 115 117 22 34.6% 35.2% 6.6%
zalluviale 7 7 8 2.1% 2.1% 2.4%
ZoneConstruite 30 30 26 9.0% 9.0% 7.8%
NA 2 1 NA 0.6% 0.3% NA

The high proportion of point locations and 5m buffers classified as transport could reflect that the ants were collected along a road in the forest, or that the coordinates were recorded after returning to the car.

Lisière

The a priori expectation is that the point locations should be somewhat better for narrow habitat types like lisière. I would also expect poor performance across all methods, since inaccuracy in the point location is likely to move the point outside the habitat polygon, and buffers will include more non-target habitat types.

## Descriptions with 'lisi.re': 205

Habitat extractions for tubes with HABITAT reported as lisière
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 80 80 94 39.0% 39.0% 45.9%
ForetConifere 12 12 14 5.9% 5.9% 6.8%
ForetFeuillus 20 19 20 9.8% 9.3% 9.8%
ForetMixe 15 12 15 7.3% 5.9% 7.3%
lisiere 33 35 19 16.1% 17.1% 9.3%
marais 3 3 3 1.5% 1.5% 1.5%
pierrier 8 8 8 3.9% 3.9% 3.9%
PrairieSeche 1 1 1 0.5% 0.5% 0.5%
transport 20 22 18 9.8% 10.7% 8.8%
ZoneConstruite 13 13 12 6.3% 6.3% 5.9%
zalluviale NA NA 1 NA NA 0.5%

The point locations and 5m buffer capture lisière about equally, but it is still only 17% of the tubes with lisi.re in the HABITAT description.

Roads

Like for lisière, the a priori expectation is that the point locations should be better for transport, but with relatively poor performance across all methods. There are many descriptions in HABITAT that use the word chemin, but that’s probably used more often for trails rather than actual roads.

## Descriptions with 'rue' or 'route': 105

Habitat extractions for tubes with HABITAT reported as transport
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 33 33 37 31.4% 31.4% 35.2%
CulturePerm 1 1 2 1.0% 1.0% 1.9%
ForetConifere 2 2 2 1.9% 1.9% 1.9%
ForetMixe 2 2 2 1.9% 1.9% 1.9%
lisiere 5 5 4 4.8% 4.8% 3.8%
transport 37 38 34 35.2% 36.2% 32.4%
ZoneConstruite 25 24 23 23.8% 22.9% 21.9%
PrairieSeche NA NA 1 NA NA 1.0%

More tubes with HABITAT descriptions including rue and route are classified as transport than any other category based on 5m locations, but it is still only about a third. Surprisingly, there is not much difference between the point locations and the buffers.

Zone Construite

The ZoneConstruite category should also be unambiguous.

## ZC keywords: maison|appartement|étage|balcon|cuisine
## Descriptions with keywords: 123

Habitat extractions for tubes with HABITAT entries containing a ZC keyword
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 4 4 4 3.3% 3.3% 3.3%
CulturePerm 3 3 3 2.4% 2.4% 2.4%
ForetFeuillus 1 1 1 0.8% 0.8% 0.8%
ForetMixe 2 2 2 1.6% 1.6% 1.6%
lisiere 4 4 3 3.3% 3.3% 2.4%
transport 8 9 7 6.5% 7.3% 5.7%
ZoneConstruite 101 100 103 82.1% 81.3% 83.7%

Generally good correspondence, with minimal differences among buffers.

Pastures

Samples collected in pastures should be classified pretty reliably as Autre or PrairieSeche. The table and figure exclude HABITAT descriptions that contain lisi*re.

## Descriptions with p*turage: 187

Habitat extractions for tubes with HABITAT entries containing p*turage
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 97 98 109 61.8% 62.4% 69.4%
ForetConifere 2 2 2 1.3% 1.3% 1.3%
lisiere 23 23 22 14.6% 14.6% 14.0%
PrairieSeche 12 12 12 7.6% 7.6% 7.6%
transport 18 17 6 11.5% 10.8% 3.8%
ZoneConstruite 5 5 6 3.2% 3.2% 3.8%

Tubes collected in pastures based on HABITAT description are distributed across only 6 types of extracted land cover, with most aligning with Autre. Sizable numbers mapped to lisière or transport, despite removing descriptions containing the word lisi*re.


Gardens & urban areas

HABITAT & OpFo habitats

Some of the HABITAT descriptions specify that they were collected in gardens. My expectation is that these tubes should almost entirely categorized as ZoneConstruite, Autre, and CulturePerm.

## Descriptions with 'jardin' or 'potag*': 257

Habitat extractions for tubes with HABITAT entries containing a jardin keyword
Categorie n_pt n_5m n_10m pct_pt pct_5m pct_10m
Autre 23 23 23 8.9% 8.9% 8.9%
CulturePerm 12 12 12 4.7% 4.7% 4.7%
ForetConifere 5 5 7 1.9% 1.9% 2.7%
ForetFeuillus 3 3 3 1.2% 1.2% 1.2%
ForetMixe 2 2 2 0.8% 0.8% 0.8%
lisiere 7 7 3 2.7% 2.7% 1.2%
transport 28 28 16 10.9% 10.9% 6.2%
ZoneConstruite 177 177 191 68.9% 68.9% 74.3%

The buffers both place about 95% of the tubes in ZoneConstruite, Autre, or CulturePerm, compared with 83% of the point locations, which include a higher percentage of transport.

Gardens & CORINE urban areas

As a first approximation, we could classify points as inside or outside urban areas using the CORINE land cover categories (1XX indicate human-dominated types). To categorize tubes as coming from gardens, there are two options: 1) use the HABITAT descriptions as above, including all tubes with jardin or potage in the description, or 2) using the 909 Jardin potager category in the Vaud land use dataset. Unfortunately, there are no tubes that are categorized as 909 Jardin potager based on location, regardless of buffering.

There are too few tubes with CORINE classifications of 111 Continuous urban fabric, which identifies parts of the few largest cities in Vaud. The 112 Discontinuous urban fabric identifies most (but not all) towns, so the comparison of gardens between cities vs. non-cities would need to be between urban and non-urban categories.
CORINE

Corine: 5m (point and 10m buffer are identical)
Garden Non-Urban Urban
FALSE 3961 2568
TRUE 66 191

Population & density

Another possibility would be to categorize communes based on population (that’s the smallest unit I’ve found).

For the distribution of population sizes among communes, there isn’t much of a clear breakpoint aside from Lausanne.


Extracted habitat vs. Vaud

OpFo habitats

Using the habitat types from the structured samples, the public dataset clearly overrepresents ZoneConstruite and underrepresents Autre.

CORINE

Similarly with the CORINE dataset, category 112 Discontinuous urban fabric is very overrepresented, with clear underrepresentation for 211 Non-irrigated arable land and 312 Coniferous forest.

For reference:

Land use

For the land use, many crops are underrepresented, while pastures tend to be overrepresented. This is not really surprising given where people would be expected to go to collect ants.


Proximity to human structures

For each tube, we can also use the location to calculate the distance to the nearest road and/or building, and potentially what type of road it is. This could be interesting for roads, since the dataset from OpenStreetMaps distinguishes everything from paths to highways.

Roads

There are 25 different identified classes of roads or paths.

Here are maps for each different type of road, reducing them to only the top 15 most extensive categories (total length ≥ 97km).

Buffering points should have minimal influence on distance to the nearest road, since the distance would be reduced by the buffer radius uniformly. The exception would be points nearer to a road than the buffer radius, which would all have a distance of 0m. The type of road nearest to the coordinates should be similarly (mostly) unaffected, though it is possible that a buffer could intersect multiple types of roads. For now, let’s ignore that and just use the point locations.

Many samples were collected near paths, residential roads, and service roads.

## Summary of distances to the nearest road (m):
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.173  13.663  27.592  28.864 520.686
## Within 10m: 2766 tubes = 41 % 
## Within 50m: 5852 tubes = 86 % 
## Further than 200m: 79 tubes = 1 % 
## Further than 400m: 11 tubes = 0 %

As should be expected, most points are quite close to a road or path, with 41% of samples within 10m. A small number are quite far from any trail in the dataset.

Buildings

Unlike for roads, the building dataset doesn’t include any usages or descriptions, but consists of building footprints across the whole canton.

## Summary of distances to the nearest building (m):
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.36   54.48  142.81  213.40 2314.13
## Within 10m: 1646 tubes = 24 % 
## Within 50m: 3277 tubes = 48 % 
## Further than 200m: 1790 tubes = 26 % 
## Further than 400m: 732 tubes = 11 %


Species accumulation curves

We could make species accumulation curves to compare across habitat types or to compare with the structured samples.

Public v. structured samples: Total

Public v. structured samples: By habitat

OpFo habitats

CORINE categories

Restricted to categories with greater than 10 tubes.

Gardens

Using the garden categorization based on HABITAT descriptions as above, and the urban / non-urban categorization based on the CORINE dataset as above, we could maybe compare gardens in and out of urban zones.

Pastures

Using pastures from the HABITAT descriptions as above, restricting to Autre and Prairie Sèche based on 5m buffers.


Summaries by habitat

OpFo habitats

Using the categorizations from the 5m buffer, sorted by nTubes:

CORINE categories

Using the categorizations from the 5m buffer, sorted by nTubes:

Land use categories

Using the categorizations from the 5m buffer, sorted by nTubes:

Structured sample comparisons

For the public inventory dataset, we’ll use the categorizations from the 5m buffers. The geolocations of the structured inventory are highly reliable since they were carefully placed to align with the habitat layer. We can thus use the point locations directly. However this should be updated depending on the question: currently, only the detections from the structured samples are included, as if each tube were independent and as if the plots without detections did not exist. If the goal is to compare the ants, this is fine. If it is to compare sampling effort, it should be changed to use the plots instead of the tubes.

Habitat distributions

These plots compare the composition of the ant samples in each dataset to the composition of the habitats in Vaud. Thus, they show some combination of 1) discrepancy between the composition of the sampling effort compared to the composition of Vaud, and 2) discrepancy in the ant densities across habitats compared to Vaud. For the public samples, we don’t know the weight of each component, though almost certainly 1) is more important. For the structured points, we could standardize since we know the sampling effort.

OpFo

The tubes in the structured samples are qualitatively more similar to Vaud than are the public samples. The largest discrepancies seem to be that the structured samples over-represent Prairie Sèche and Autre, and under-represent Forêt Conifère and ZoneConstruite. We could test whether these differences are significant (overall \(\chi^2\) is significant).

CORINE

Again, the structured samples are generally more similar to Vaud than are the public samples. The biggest discrepancy between the structured samples and Vaud are over-representation of 321: Natural grasslands and 243: Agricultural land with significant areas of natural vegetation, and under-representation of 112: Discontinuous urban fabric, 211: Non-irrigated arable land, and 312: Coniferous forest. This largely matches with the OpFo categories, but with land that probably corresponds with pastures (321, 243) separated from land that probably correspondss more with crop (211). BUT I NEED TO LOOK TO SEE HOW THESE CATEGORIES ALIGN ACROSS DATASETS.

VD land use

The structured samples include more samples from crop. Samples from 611 Prairies extensive are quite over-represented. In fact, the majority of prairies and pastures seem to be over-represented (601-623). This obviously makes sense if ants are more likely to be found in pastures and prairies than cropland.


Collectors

The COLLECTORFIELDNUMBER column identifies each collector. The group field trips were entered as a single collection (e.g., COLLECTORFIELDNUMBER == "collector_formation1"). We could look at the impact of including these field trips (direct: tubes collected; indirect: proportion of participants sending in additional samples, etc), and also the impact of including a small number of experts, comparing them to the rest of the citizen science effort. DO WE KNOW THE COLLECTOR IDs FOR THE EXPERTS?

Collector summary table

The number of tubes and species per collector, sorted by the number of species collected, S:

Collector accumulation curves

We could categorize the COLLECTORFIELDNUMBER as group trip, expert, public. For now, we can call anyone who collected >20 species an expert. I imagine the names are somewhere, and maybe we also call anyone in the DEE an expert?

## Experts:
##    collector_0033, collector_0036, collector_0058, collector_0099, collector_0133, collector_0142, collector_0547

Collectors by date

How did the number of collectors change across the year?

New collectors by date

How did the number of new collectors change across the year?

Collection dates

The DATECOLLECTION column records the collection date reported by the collector. I’m not sure if it would be possible to look for effects of the group collecting trips or news stories. There isn’t anything obvious by day, but maybe by week… There are a couple of errors in the dates (year = 2018: excluded from plots).

Number of tubes

Number of species